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1 . Introduction 


It  is  desired  to  keep  a satellite  close  to  a fixed  point 
in  space  when  it  is  subject  to  random  forces.  Fuel  may  be  used 
to  accelerate  it  at  appropriate  times.  The  permitted  accelera- 
tion is  bounded.  While  it  may  be  desirable  to  maximize  the 
performance  for  a given  amount  of  fuel,  we  shall  consider  the 
fuel  as  available  in  unlimited  amounts  at  a fixed  cost  per  unit. 


Our  object  will  be  to  minimize  the  long  run  expected  average 

cost  per  unit  time  assviming  some  cost  for  being  away  from 

target.  The  random  forces  are  modeled  by  Bro%mian  Motion.  In 

this  discussion  we  shall  treat  the  position  in  space  as  one- 

dimensional  and  given  by  x^^  and  the  cost  of  being  at  x^^  is 
2 

per  unit  time.  Let  Xj  = velocity  ■hnd  x = x(t)  = , 

X2(t))  describes  the  current  state  of  the  satellite  which  is 
subject  to  the  laws 


(1.1)  dXj^  - Xjdt 

(1.2)  dXj  ® udt  + adw(t)  | u|  ^ Uq 


where  w(t)  represents  standard  Brownian  Motion  with  E[dw(t)]«0 
2 

and  E[dw  (t)  ] » dt  and  where  u represents  the  acceleration 


which  is  subject  to  control  depending  on  t and  past  history. 


1 
k 1 


The  running  cost  per  unit  time  is  given  by  r(x,u)  where 
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(1.3) 


dR(t)  = (c|ul  + )dt  = r(x,u)dt 


We  shall  seek  a policy  (P  which  minimizes 


(1.4)  Y=  lim  sup  E{C(x,0,T)/T} 

where  C(x,t,T)  is  the  total  cumulated  cost  oyer  the  time 
period  (t,T)  when  the  state  process  X(t'):  t £ t'  <_  T 
originates  at  = x at  time  t.  The  random  process 

X(t')  is  dependent  on  the  policy  and  hence  the  expectation 
involves  implicitly.  When  there  is  a possibility  of 
confusion,  the  policy  ^ will  be  indicated  as  a superscript 
on  the  expectation,  e.g.  E^C(x,t,T). 

Intuitive  considerations  show  that  using  an  acceleration 
of  u for  a time  interval  dt  emd  0 for  dt  is  equivalent 
to  using  u/2  for  2dt.  Thus  the  search  for  an  optimal  policy 
may  reasonably  be  confined  to  policies  which  use  only  the  values 
u - ±Up  and  0.  For  a stationary  policy  where  u * u(x^,X2), 

independent  of  t and  past  history  (with  some  minor  abuse 
of  notation) , the  policy  can  be  described  by  dividing  up  the 
(x^,X2)  state  space  into  three  regions  A^,  A and  A^  where 
u ■ '*■'*0'  “'*0  ® respectively. 


3. 


The  main  object  of  this  paper  is  to  describe  a simple 
nvimerical  approach  for  deriving  and  evaluating  an  optimal 
policy.  The  basic  method  is  to  apply  backward  induction  to 
the  Markov  Decision  problem  that  is  formed  from  approximating 
the  continuous  space  time  problem  described  above  by  a bounded 
discrete  space,  discrete  time  version  of  the  problem.  A 
similar  approach  was  applied  by  Kushner  and  Kleinman  [6  , 1 , 8]. 
The  main  difference  in  this  approach  is  in  the  method  of  handling 
the  edge  effects  at  the  boundary  of  the  bounded  space. 

Since  the  method  is  iterative  and  uses  an  initial  approxima- 
tion, this  was  obtained  from  an  approximation  to  the  solution 
of  the  deterministic  version  of  the  original  problem  where 
there  are  no  random  forces.  In  Section  2 the  cost  associated 
with  a special  suboptimal  policy  in  the  ^deterministic  problem 
is  evaluated  and  optimality  conditions  are  introduced  to  explain 
how  this  candidate  was  selected  and  why  it  is  suboptimal . In 
Section  3 the  optimal  policy  for  the  deterministic  problem  is 
described. 

For  both  the  deterministic  and  stochastic  versions  of  the 
problem,  the  homogeneity  of  the  cost  functions  permit  one  to 
standardize  the  problem  and  effectively  eliminate  two  of  the 
parameters  c,  Uq,  and  a by  applying  linear  transformations 
to  the  variables  x^,  X2#  t,  u and  w.  These  transformations 
are  described  in  Section  4. 


After  discussion  of  the  relationship  between  the  solution 
and  bounds  on  the  solution  and  a free  boundary  problem  in 
Section  5 the  discrete  approximation  to  the  problem  is 
described  in  Sections  6,  7 and  8.  Sections  9 and  10  are 
devoted  to  several  alternative  treatments  of  edge  effects. 

Some  miscellaneous  remarks  appear  in  Section  11  and  finally 
Section  12  presents  some  results  of  preliminary  computations. 
2.  The  Deterministic  Version  - A Suboptimal  Policy 

If  there  is  no  random  noise,  i.e.,  o = 0,  it  is  easy  to 
find  control  policies  which  bring  the  satellite  to  = 0 
with  zero  velocity  in  a finite  time  interval.  Thus,  for  the 
deterministic  version  of  the  problem  it  is  reasonable  to 
consider  simply  the  total  cumulated  cost  V(x^,X2)  associated 
with  a given  policy  and  to  minimize  that. 

It  is  clear  that  an  optimal  policy  for  the  deterministic 
problem  will  correspond  to  decomposing  the  state  space  into 
three  sets  A^,  Aq,  A_  on  which  u = Uq,  0 and  -Uq 
respectively. 

For  one  of  our  numerical  procedures  for  the  stochastic 
problem  we  shall  make  use  of  an  approximation  to  the  optimal 
solution  of  the  deterministic  problem.  The  precision  of  our 
approximation  is  not  crucial  and  so  we  shall  use  a moderately 
convenient  approximation.  In  this  section  we  describe  that 
siiboptimal  procedure,  and  compute  its  cumulated  cost  V(x^,X2) 


The  soiirce  of  this  policy  as  well  as  an  explanation  of  why  it 


5. 


is  subopt imal  in  terms  of  conditions  of  optimality  will 
conclude  this  section.  We  shall  follow  this  in  Section  3 
by  a brief  description  of  the  optimal  policy  for  the  special 
values  of  the  parameters  Uq  = c = 1. 

The  suboptimal  policy  is  described  in  terms  of  three 
sets  A^,  Aq,  and  A_  where  one  applies  u = ‘•'^0' 

-Uq  respectively  (See  Figure  1) . These  in  turn  may  be 

* 

expressed  in  terms  of  two  curves  Cq  and  C and  their 
reflections  about  the  origin 


(2.1)  Cq  * {(Xj^,X2)  : = X2V(2Uq),  X2  < 0} 

(2.2)  C*  = {(Xj^,X2)  s = 6CX2  + X2/(8Uq^),  X2  ± 0} 

* 

The  set  Aq  consists  of  the  region  between  Cq  and  C and 
its  reflection.  More  precisely  (Xj^,X2)  e Aq  if  X2  <_  0 and 
X2V(2Uq)  <_  Xj^  < I6CX2  + X2/(8Uq^)  or  if  (-Xj^,-X2) 

satisfies  these  two  inequalities.  The  set  A_  consists  of 

* 

the  region  between  C and  the  reflection  of  Cq  and  A^ 
is  the  remaining  part  of  the  plane. 

Under  this  policy  and  the  laws  of  motion 


(2.3)  dxj^  - *2^*^ 


(2.4) 


u dt 


a satellite  whose  state  (Xj^,X2)  is  in  A_  will  move  to 

* 

C under  u = -Ug  following  a parabolic  path  where 

2 * 

*1  constant.  From  C it  will  move  to  Cg 

keeping  x^  fixed.  Once  it  hits  the  parabolic  path  Cg, 

Ug  and  it  follows  Cg  to  the  origin.  We  compute  V for 

this  policy  by  retracing  the  path  of  the  satellite  from  the 

origin  to  (Xj^,X2)  . 

If  (Xj^,X2)  e Cg, 

, n \ 0 I j 0 

V(x^,X2)  = V'  Mx^,X2)  :=  / (cUg  + x^^)dt'  = / (cUg  + 

X ^2 

(The  first  integral  is  a line  integral  along  Cg  from 
X ■ (Xj^,X2)  to  (0,0) .) 

(2.5)  V^°Nxj^,X2)  = -CX2  - x^/(20Ug^) 

If  (Xj^,X2)  E Ag  with  X2<0, 

V(Xj^,X2)  - V^^Mxj^,X2)  :=  V^°Nxj^g,X2g)  + x^^  df. 

where  3^  « (x^g,X2g)  is  the  state  at  which  the  satellite 
originating  at  (x. ,x,)  e A^  first  intersects  C..  Then 


If  (Xj^X2)eA_, 

V(Xj^,X2)  = (X3^,X2) 


* 


where  x*  is  the  point  on  C*  which  the  satellite  first  ^reaches 
from  xeA_.  Then  x*  is  determined  by  Xj^  ^ “fcXj  + Xj  /(8Uj  ) 
and  Xj*  t Xj*^/(2u„)  = a:=  x^  + XjV(2u„).  Then 


I 


* 


X 


I 

X 


(cuQ+Xj^^)dt' 


I 


8. 


For  (x.^,x^)e  and  that  part  of  Aq  where  X2  > 0,  we 
may  apply  symmetry  to  obtain  V(Xj^,X2)  = VC-x^r-x^,). 


2* 


Having  computed  V,  we  may  ask  how  C was  selected 
and  why  V is  not  optimal.  First  we  deal  with  conditions  for 
optimality. 

Let  ^ be  a policy  which  assigns  control  value  u = u(Xj^,X2»t). 
Let  V(Xj^,X2»t)  be  the  cumulated  cost  associated  with  ^ from 
time  t on  if  x(t)  = (Xj^,X2)  . Let 


(2.8) 


o^V  = X,  + u 
u ^ *1  *2 


Mote  that  depends  on  since  it  involves  the  control  u. 

If  V «*>  , the  laws  of  motion  (2.3),  (2.4)  imply  that  along 

J 

the  path  prescribed  by  (?  , V changes  by  dV  * oTg^dt  = 

(x-V  + uv  )dt  and  hence 

2 Xj^  X2 

(2.9)  ^ ^ ^ ~ ° 


which  may  be  interpreted  as 


(2.9a) 


dV  + dR  * 0, 


where  dR  = (c|u|  + x^  )dt  *®  r(x,u)dt  and  r is  the  running 
cost.  For  a policy  defined  by  A^,  A^,  and  A_  as  the  one 
considered  above,  u assumes  the  values  Uq,0  and  -Uq 


I 


i 

9. 

j 

respectively  on  these  sets.  For  a stationary  or  time 

independent  policy  where  u=u(x,t)  is  independent  of  t,  , ^ 

I 

V is  independent  of  t.  j 

t 

Suppose  now  that  V=V(Xj^,X2)  is  a function  vanishing  I 

at  the  origin  and  with  the  property  that  no  matter  what  j 

policy  is  followed, 

} 


(2.10a)  dV  + dR  > 0. 

Integrating  along  the  path  followed  by  a policy  (?  which 

drives  the  satellite  from  (Xj^,X2)  to  (0,0)  we  have  I 

V(0,0)  - V(x^,X2)  + / dR  = V(0,0)  - V(Xj^,X2)  + V*  > 0 i 

I 

I 

where  V is  the  cumulated  cost  for  (?  . Since  V(0,0)=0, 

i 

(2.11)  V*  > V(Xj^,X2)  . 

f ^ 

If  there  is  also  a policy  6^  for  which  dV  + dR  = 0 it  j i 

would  follow  that  V is  the  cost  for  ^ and  ^ is  optimal.  ' 

Thus  the  optimality  of  ^ would  follow  if  we  could  show  in  j 

addition  to  (2.9) 

! j 

2 * 
(2.10)  X2V^^  + uVj^^  + clu|  + x^  > 0 

for  all  1u(<Uq  and  all  x.  J 


In  effect  we  have  proved 


Theorem  2.1  For  any  policy  'p  with  finite  cummulated  cost, 

dv  + dR  = 0 . ^ V is  the  cumulated  cost  associated  with  a 
stationary  policy  {?  and  dV  + dR  ^ 0 for  each  policy  P 
then  (?  is  optimal. 

In  our  application  we  can  fcombine  equations  (2.9)  and 
(2.10)  and  we  see  that  optimality  of  a policy  defined  by 
A^,  Aq  and  A_  requires 

V < -c  on  A 

*2  - 

(2.12)  |V^^|  < c on  Aq 

V > c on  A 

*2 

We  now  return  to  the  policy  proposed  at  the  beginning  of 
this  section.  Given  our  decision  to  apply  u=Uq  on  Cq 
and  to  include  in  Aq  points  in  the  fourth  quadrant  immediately 
to  the  right  of  Cq,  we  have  for  those  points.  But 

for  such  points  1 “ |-c-X2^/24Uq^  + Xj^V3x2l  5.  c as 

long  as  ^ X2V(8Uq^)  and  x^^^  *2^^^ ’ 

Thus,  given  our  decision  to  apply  u = Uq  on  Cq,  the  choice 

* ... 
of  C for  the  boundary  of  A is  optimal. 


I 


I 

I 


I 


I 


mniiii  l<i!  rml'iiqiiJiiiirtf itfBkiliaiiiin 
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However,  if  we  study  V on  the  A , as  we  retrace  the 

^2 

path  of  a satellite  from  C*  along  the  parabola 

+ x^/{2nQ)  = a = x^*  + 

it  is  possible  to  see,  by  calculations  similar  to  that  to  be 

A 

given  in  Section  3,  that  as  X2  increases  from  X2  , V2  j 

increases  from  c at  first  but  eventually  decreases  below 
c in  part  of  A_  in  the  second  quadrant.  But  optimality 

would  demand  that  that  part  of  A_  should  be  in  Aq  and  1 

i 

our  policy  fails  to  satisfy  the  optimality  conditions  (2.12)  . 

An  intuitive  explanation  is  that  by  using  u = Ug  on 
Cq,  we  slow  down  the  satellite  while  x^^  is  large  and  | 

accumulate  a large  cost  by  staying  in  a region  of  large  Xj^  j 

too  long.  Apparently  it  is  preferable  to  pass  through  Cg  i ; 

before  slowing  down.  While  this  means  we  overshoot  the  Xj^  = 0 
target,  we  do  so  where  is  relatively  small  and  the  additional 

cost  incurred  from  having  to  retrace  our  path  is  less  than  that  . 

of  tarrying  too  long  in  a region  of  large  x^^ . 

Although  the  policy  of  this  section  is  suboptimal,  it 
resembles  the  optimal  policy  sufficiently  to  serve  as  a useful 
device  for  the  numerical  analysis  of  the  stochastic  problem. 

It  is  of  some  interest  to  repeat  that  the  optimality  ■ 

condition  is  violated  when  X2  > 0 only  if  x^<  0.  Hence 


this  policy  is  optimal  if  we  add  the  restriction  that  Xj^>  0. 
Thus  it  represents  a deterministic  solution  to  a problem  of 
a soft  landing  on  a planet  from  a rocket  stationed  vertically 
over  the  point  of  impact.  Of  course  the  force  of  gravity  must 
be  assumed  to  be  constant  and  this  particular  solution  is 
meaningful  only  for  a cost  function  which  is  rather  peculiar 
in  the  soft  landing  application.  Shepp  studied  a similar 
deterministic  problem  where  the  cost  was  that  of  fuel  and 

the  time  to  reach  the  target.  In  that  problem,  the  optimal 

* 

policy  does  use  Cq  but  C is  replaced  by 

{X*;  = (1+4cUq)x2*V(2Uq)  } 


3.  The  Deterministic  Version  - Optimal  Policy 

In  the  interest  of  simplicity  let  us  consider  the  deter- 
ministic problem  for  the  case  c = Uq  = 1.  There  is  no  real 
loss  of  generality  since  a linear  transformation  of  the  para- 
meters and  state  variables  to  be  described  in  Section  4 , permits 
us  to  normalize  our  problem  to  this  standard  case. 

Heuristic  considerations  suggest  that  the  optimal  policy 

may  be  described  by  Aq,  A_  and  A^  bounded  by  new  curves 

* 

Cq,  C in  the  fourth  quadrant  and  their  reflections  about  the 
origin.  The  cost  associated  with  the  optimal  policy  V(x^,X2) 
will  satisfy 


+ uV 


u + X, 


0 


I ; 


1 
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with  u = 0,  “1,  and  +1  on  Aq,  A_,  and  A^  respect- 
ively. Moreover,  the  optimality  conditions  (2.12)  require 
IV  1 < 1 on  Art#  V >1  on  A and  V < -1  on  A,  . 

' X2'  — 0 X2  — - *2  ~ 

Given  a point  (x,  on  C-  where  V = -1  and 

^ 10'  20  0 X2 

2 

X2V^  = - , we  compute  V backwards  from  this  point 

along  the  path  of  points  going  to  C^, 


^ , 2 


V(x^,X2)  = V(Xj^Q  X20)  + / df 


= V(Xj^q,X2q)  + (x^Q  J/3x2 


where  x. 


^^^^10*^20^  ^^10  ^ ^^^^10'^20^  ^ ^10  ^^10  ^*10  “*1  ^ 
3Xio  3X2  3x2q  X2  3X2  3x2^ 


But  since  x,V  + x,  =0  on  A.  and  V = -1  on  C- 
M X u ^2  ^ 


- 1 - (x^q^-Xj^^)/3X2^, 


As  we  retrace  the  paths  from  (Xj^q»X2q)  , X2  remains  fixed  at 

XjQ  and  Xj^  increases.  Thus  increases  from  -1  to 

3 3 2 * 

+1  when  Xj^  =»  Xj^q  + 6x2q  . Thus  the  boundary  C , determined 

by  the  optimality  conditions  and  C-  , is 


14 


r 


(3.1) 


* * *3  3 2 * , 

C = {x  ; = ^10  *2  “ *20^ 


* * 


Given  a point  ,X2  ) on  C we  compute  V for 

2 * *2  t.  • u 

points  on  the  path  x^  + X2  /2  = a = + X2  /2  which 


* * 

leads  to  (x^^  ,X2  ) 


* * 


V(x^,X2)  = V(Xj^”,X2  ) + / (l+x^^)dt' 


* * 


2 xl^  2 dx- 

= V(Xj^",X2  ) + / 11+ (a j— ) ] Y^ir 

*2 


5 *5 

2 * a 3 *3  *2  ”^2 

= V(x^",X2")  + (1+a  ) (X2-X2  ) - 3 <*2  "^2  ^ 20 


* * 


3V(x 


* * 


* * 


V„  ® 


dX., 


, ,x-  ) 3X,  3V(X,  ,x,  ) ax-  X2 

—L-  + i-2 ^ + [i+(a- 

ax  ^ * ■aX'»  2 


)1 


*2  * 

*9  2 ax-  * 1.3  *3  aa 

- ll+(a-4— ) 1 3jf-  + I2a(X2-X2  ) - j (Xj  - Xj  ) ] jj- 


* * * 


_ _ _ *2  * 

Substituting  a V(Xj^  ,X2  )/3Xj^  " " *1  ^*2  ' 


* * 


3 V(Xj^",X2')/3x2*  = 1»  a - Xj^*  + Xj^Vz  ■ x^  + X2V2  , 


* * 


aa/ax2  “ 3Xj^  /?X2  + X2 3X2/ 3x2  * *2'  have 


i 


i-i 


r’ 

ui 
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* * 

3x,  * 3X2 

— = — + X — - — 

3X2  2 3X2 


+ 1 + 


+ t2a(x2-X2*)  - j ^^^^2 

*2  2 - 
X X 2 

(3.2)  = H(x2)  :=  1 X2  + (a 1— ) + X2  [2a  (X2-X2*) 

2 Xo 


- 1(^2 


As  we  retrace  the  path  from  x^^  , X2  / a remains  fixed  but 

X2  increases.  As  a polynomial  in  X2/  H increases  from  1 

* 

at  X2  = X2  but  eventually  decreases  again  since  the 

coefficient  of  X2^  is  (1/4)  - (1/3)=  - 1/12.  Let  C be 

^ * 

the  curve  of  (3tj^,X2)  which  X2  is  the  first  X2^^2  where 

~ ~2 

H(X2)  * 1 and  x^^  + X2/2  = a.  It  is  easy  to  see  that  C is 
in  the  second  quadrant  since 

H'(X2)  = - (x^*Vx2*)  + 2a(x2-X2*)  - (x2^-X2*^)/3 

* 

is  positive  at  X2  ■ X2  <0  and 
H"(x2)  * 2a  - X2^  = 2x^ 

is  positive  as  long  as  is  positive.  Thus  H(x2)>  1 for 


17. 


(3.5) 


Xj^Q  - 0.44462  X2  , *^*10  * 0-54300x2 


Z 0.44462  X2^  , /2x^*  » 0.94300  X2. 


Moreover 


(3.6) 


X2  ~ -4.13016  X2 


Xj^  = -7.58449  X2  . 


* 2 l/'i  2 -3 

Applying  (3.1)  , = 6X2  ) ' a x^^q  (1+2x2  *10  ^ 

we  have 


(3.7)  X,*  - x^Q  z 10.11685  x,^. 


Figure  1 sliows  the  optimal  region  Aq  obtained  by  starting  from 

2 

x^  with  Xj^Q  = X2Q  /2  for  small  X2q  and  computing  a sequence 

* 

of  successive  values  of  x and  ^ = -x.  The  scune  calculation 
with  initial  point  x^^q  = 0 leadt:^  to  almost  identical  points 
when  the  initial  X2Q  is  small. 


4. 


Transformations . 


The  homogeneous  nature  of  the  cost  x^  permits  one  to 
normalize  both  the  stochastic  and  deterministic  versions  of  the 
problem  by  means  of  simple  linear  transformations.  This  normaliza- 
tion effectively  reduces  the  number  of  parameters  that  need  to  be 


1 
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considered  by  two  and  is  of  considerable  convenience  although 
not  of  fundamental  importance. 

We  start  with  the  deterministic  problem.  Let 

* * * * 

(4.1)  = aj^Xj^,X2  = ^2*2'  ^ “ ^3^#  u = a^u. 

Then  applying  (1.1) -(1.3)  we  have. 


* _1  _1  * 
dXj^  = = ®1^2  *2 


* -1  -1  * * I * . 

dx2  = ^2^^^^  “ ®2®4  ^3  ^ I— ^4“0 


-1  -1  * -1  -2  *2  * 
dR  = (ca^  a^  u H-a^  a^^  Xj^  )dt 


-2  -1  -1  2 * *2  -2  -1  * 
a^^  a^  [ca^  a^^  u + Xj^  ] dt  * a^^  a^  dR  . 


-1  -1  -1  -1  * 

If  we  set  *3  * ®2^4  ^3  “ **0  * ^A^O' 

* -12 
c = ca^  aj^  we  have 


* * 1/2 

*1  " 


(4.2) 


*3  - (c  Uq/cUq  ) 


* 1/4 


, * *3.  3,1/4 

, aj  - (c  Uq  /cUq  ) 


, a^  - Uq  /Uq 


and  our  problem  is  now  in  the  original  form  except  that  c and 

* * 

u.  have  been  replaced  by  c and  u„  and  the  cost 


ii 


!1 
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5 3 *5  *3  * * * 

(4.3)  R(Xj^,X2;c,Uq)  = (c  Uq  /c  Uq  ) R (a^x^ ,a2X2;c,UQ  ) 


If  we  wish  we  can  normalize  the  starred  version  by  setting 

* * 

c = “o  “ ^ which  case  the  solution  of  the  original 
problem  can  be  expressed  in  terms  of  that  of  the  normalized  one. 
The  stochastic  version  of  the  problem  is  a little  more 


complicated.  Here  we  apply 


* * * * * 

(4.4)  X,  = a,x,  , X.,  = a^x-,  t = a,t.,  u * a.u,  w = a_w 

1 112  22  30  4 O 


to  (1.1)-(1.3).  Proceeding  as  before  we  have 


* _1  _i  * * 

dXi  = ^3  *2 


* _1  _1  * * -1  * I * 

dx2  = ^2^3  ^4  “ ^2  ° *5  I— ^4'^0 


-1  -1 . * * -2  -1  *2  * 
dR  * ca^  a^  |u  |dt  -f  dt 


-2-1,  2-1,  *,*  ^ *2..*, 

*1  ®3  ‘*^^1  *4  1^  ^1  J 


* 2 2 2 -1  * 
E(dw  ) ^ - ag'^dt  = ag  a3  -^dt 
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Our  problem  is  left  invariant  except  for  the  transformation  of 


It  it 

Uq,  o , and  c to  '^q  » 

* 

c 

rf  a3^a2  a3  = 1, 

^2^3  ^4  ~ ^2^5  ^ t ~ 

* 

“o 

2 -1  , , 

, ® 5 ^3  = 1 / and 

*2-1  * 
c = ca^^  a^  . Thus,  for  given  Uq  , 

* 

a 

select 

*4  3.4  *3 

ai  = O Uq  /a  Uq  , 

^2 

.,*2  / 2 * 

= 0 Uj/O 

*2  2 2 *2 

(4.5)  a^  = 0 Uq  /O  Uq  , 

^4 

= “o*/"o 

^5  = °*"o/°'*0* 

* 

c 

„*8  7.8  *7 

= CO  Uq  /a  Uq  . 

Finally  the  cumulated  cost  R(uQ,a,c,t)  over  the  time 
period  (0,t)  is  transformed  by 

10  *10  a * * * * 

(4.6)  R(uQ,a,c,t)  * (o  ^ 

As  an  illustration  to  which  we  shall  refer  later,  if 

* * 

Uq  = o =c  = 1 and  we  set  Uq  =1  and  a =2,  then  a^^  = 16, 

aj  = 4,  a^  = 4,  a^  = 1,  a^  = 2 , and  c = 256. 

^ 5.  Stochastic  Control  Problem  and  Free  Boundary  Problem. 

j 

Given  a policy  (P  and  an  Initial  value  x * XCIq)  of 
the  state  at  time  tg,  the  state  X(t)  at  time  t has  a 
corresponding  probability  distribution.  The  cumulated  cost  over 

1 

[ 


j 
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(tQ/t^]  is  given  by 


(5.1)  C(x,t  ,t  ) = / dR(t) 

^ t 

^0 


/ r[X(t)  ,u(t)]dt 
^0 


The  heuristic  assumption  that  for  a stationary  policy 


(5.2)  E {C(x,tQ,tj^)  } = yCtj^-t^)  + v(x)  +o(l) 


as  together  with  C(x,t,tj^)  = dR  + C (X(t+dt)  , t+dt,tj^) 


suggests 


(5.3)  {dv+  dR}  » Ydt 


where 


. E {v  [x(t+dt)  1 I X(t)  • X)  -v(x) 


'*2''x^  * "xj  * T ''x^Xj' 


(5,3*)  x,v  + uv  +^v  x-+c|u|+x, ' 

* X2  ^ ^2  * ^ 


If  we  define 


(5.4)  o^v  - V ''xjXj 


then  (5.3')  may  be  written  as 


(5.3")  dtv  + r(x,u)  = Y 


Bather  [1]  called  the  function  v , first  introduced  by 
Howard  [5] , the  potential  function.  It  has  also  been  called 
the  value  difference  function.  Thus  if  ^ is  a stationary 
policy  which  imposes  u = 0 , ~ ^0  in  Aq  , A^  , » 

Equation  (5.3')  converts  into  separate  equations  in  each 
region.  The  heuristic  reasoning  leads  to  a more  solid 
interpretation  if  we  introduce  a truncated  version  of  our  problem 
which  teirminates  at  time  with  terminal  cost  v ] . 

The  expected  cost  of  (?  for  this  problem  is 


(5.5)  D^(x,tQ,tj^)  = E^C(x,tQ,t^)  + E^{v[X(tj)]|  XCtg)  = x } 


If  V is  a function  which  satisfies  (5.3)  then  integrating 
(5.3)  (formally,  this  is  an  application  of  Dynkin's  formula 
( 3,  p.1331  ) gives 
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(5.6)  D^^(x,tQ,t^)  = Y(tj^-tQ)  + v(x) 

and  Y is  the  expected  long  run  average  cost  of  (?  . 

Furthermore  if  (?  is  a stationary  policy  and  v is 
a function  such  that 

G>*  a 

(5.7)  E {dv  + dR}  > Y dt  = EMdv  + dR} 


for  all  X and  all  policioR  J , then 

D^*(x,tQ,t^)  >Y(tj^-tQ)  + v(x)  = (x,tQ,tj^)  and  (P  is 
optimal  for  the  truncated  problem  with  terminal  cost  v.  If  • 

Q*  is  a stationary  policy  for  which  ^ {v{X{tj^)]  | ^(t^)  = x}  = 0(1) 
as  then 


* 

Y 


lim  inf 


^ C(x,t_,t,) 

— i-  > Y=  lim 

^1  ^0  t^« 


and  ^ is  optimal  among  the  class  of  stationaury  policies  which  1 

I 

satisfy  the  above  restriction. 

We  apply  the  optimality  condition  (5.7)  to  determine  bounds 
on  Y • Suppose  that  for  a given  function  v , 

P*  * 

(5.8)  inf^*  (dv  t dR)  - Y(x)dt 
Then  r replaced  by  r = r + inf  {y(x))  “ Y (*) 


! 


i 
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defines  a problem  with  an  optimal  expected  average  cost  of 

A 

Y = inf^  Y (x)  £ Y • From  this  argument  and  a similar  one 
involving  . sup^  Y (x)  , we  have  the  bounds 

(5.9)  inf  y(x)  1 Y 1 sup  y{x) 

X " 

As  in  the  deterministic  case,  the  optimality  condition  (5.7) 

converts  to  ! v |<c  on  A^,  v ^c  on  A , and  v < -c  on  A . 
^2  ^ *2  “ ^2” 

For  a given  stationary  policy  ^ , determined  by  a 

specified  Aq,A_,A^,  the  potential  function  is  a solution  of  the 

partial  differential  equations  (5.3').  The  problem  of  finding 

the  optimal  policy  is  related  to  the  free  boundary  problem  (FBP) 

of  solving  the  differential  equation  and  finding  the  regions 

Aq#A_,A^  for  which  the  optimality  conditions  are  satisfied. 

In  this  paper  we  bypass  this  analytic  problem  and  consider 

instead  a nvimerical  approximation  to  the  solution  of  the  stochastic 

control  problem  by  solving  a discrete  bounded  space,  discrete  time 

approximation  to  the  problem. 

6.  Discrete  Approximation  to  the  Stochastic  Control  Problem. 
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Xj^(t+l)  = Xj^(t)  + X2(t) 

(6.1) 

X2(t+1)  = + u(x(t),t)  + ay(t) 

where  y(t)  = ±1  with  probability  1/2  and  u is  the  control. 

If  u is  confined  to  ±Uq  or  0 where  Uq  is  an  integer  and. 
a is  an  integer,  a point  x(t)  =(  x^^  (t)  ,X2  (t) ) whose  coordinates 
are  integers  will  move  to  another  such  point.  To  bound  the  state 
space  we  must  confine  x(t)  to  a finite  set  £ of  such  points 
but  then  x(t+l)  may  no  longer  be  in  . To  handle  that  case  we 
may  replace  each  ordinary  successor  x(t+l)  of  a point  x(t) 
by  a suitably  modified  successor  in  S if  x(t+l)  is  not  itself 
a point  in  £ . As  a result  we  will  have  a modified  set  of  laws 
of  motion  where  X(t)  = xe£  and  u determine  a probability 
distribution  for  X(t+l)e£. 

We  now  define  the  cost  associated  with  the  time  interval 
(t,t+ll  to  be  f(t+l)  = r (x(tj ,u(x(t) ,t) ) if  the  ordinary 
successor  of  x(t)  is  in  £ . Later  we  shall  aodify  r somewhat. 
If  is  a set  containing  all  points  in  a large  circle  about  the 
origin,  one  would  expect  the  problem  of  minimizing  the  expected 
long  run  average  cost  for  this  problem  to  resemble  that  of  our 
continuous  time  continuous  unbounded  state  problem.  But  this 
discrete  time  finite  space  problem  can  be  solved,  and  backward 
induction  provides  a method  of  approximating  its  solution. 


! 

1 


1 
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This  program  faces  a few  difficulties.  First,  since  t 
and  the  coordinates  of  x change  by  integer  values,  our 
approximation  may  be  rather  coarse  and  require  refinement. 

Also  the  values  of  Uq  and  a may  not  be  integers.  Second  we 
have  not  yet  specified  £,  the  modified  successor  rule,  nor 
r(t+l)  if  the  ordinary  successor  is  not  in  £ . The  procedure 
for  specifying  the  successor  and  f(t+l)  will  determine  how 
large  S must  be  to  reduce  the  edge  effects  of  bounding  the 
state  space.  A poor  choice  will  lead  to  large  edge  effects  and 
require  a correspondingly  large  E to  reduce  these  effects.  But 
a large  £ requires  correspondingly  more  computing  effort.  Third, 
backward  induction  is  simple  to  implement  but  may  require  consider- 
able  computing.  It  is  possible  to  reduce  the  2unount  of  computing 
needed  by  having  a good  approximation  to  the  solution  and  by  using 
acceleration  techniques. 

Having  described  the  approach  in  principle  we  provide  some 
details.  Given  a discrete  time  finite  state  stationary  problem 
and  a terminal  cost  v ^ (x) , let 

^1 

(6.2)  C(x,to,t.)  = \ r(t) 

represent  the  cost  over  (tpt^^l  for  a procedure  given  a starting 
point 
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9 

(x,tQ,tj^)  = E {C(x,tQ,tj^)  + VQtX(tj^)ll  X(t)  = X} 


be  the  expected  cost  for  the  problem  with  terminal  cost  Vq. 
Backward  induction  permits  one  to  compute  both  the  optimal  average 
for  the  problem  with  terminal  cost  Vq  and  the  optimal  policy, 
using  the  equation 


D^(x,t.,t,)  = inf  E[D^  (X(tQ+l) ,to+l,t. )+r(tQ+l)  |X(tQ)  = x,  u] 

Vrt  U X A V u X W w 


u=±Uq,0 


^0-^1 


(6.3) 


(x,ti,ti)  = Vq(x) 


where  we  recall  that  the  distribution  of  Xvtp+l)  depends  on 
u.  The  conditional  expectation  in  the  above  expression  is  the 
average  of  2 terms  involving  y = ±1.  Under  suitable  conditions, 
concerning  non-periodicity  and  ergodicity,  it  is  possible  to  show 
that  as 


(6.4) 


(x,tQ,tj^)  = yCtj^-tg)  + v(x)  + 0(1) 


where  v(x)  is  the  potential  function  and  y is  the  expected 
long  run  average  cost  for  the  optimal  long  run  stationary  policy. 
Moreover  the  policy  (P  also  converges  in  the  sense  that  for 
tj^-tjj  sufficiently  large  the  backward  induction  policy  at  tg 
coincides  with  the  optimal  long  run  stationary  policy.  In- 
cidentally the  closer  Vq  is  to  v , the  quicker  this  method 
converges. 

If  we  let  (0,-n,0)  - (0,-(n-l),0)  = y and 

^0  “ 0 

dJ  (x,-n,0)  - (0,-n,0)  = v (x)  , then  Yj.  ^ 

^0  '^0 

iO  * 

v (x)  -»•  v(x)-v(0)  . Substituting  (x/-(n-l),0)  for  v 

n Vq 

in  a discrete  version  of  (5.8)  we  have  v„(x)  - v , (x)  + y 

n — n—J.  — n 

in  place  of  y (x)  . Hence  bounds  on  y are  provided  by 


(6.5)  Yjj  + inf^  ['^n^-^"^n-l^-  ^ ^ ®^Px  ^^n  "'^n-1  ^ * 


Finally  satisfies  the  slightly  simpler  looking  version 

of  (6.3)  which  is  given  by 


(6.6)  y^  + v^(x)  = inf  E (v^_j^  (X(tQ+l)  ]+r(tQ+l)  |X(tQ)  « x ] n ^ 1 


where  y^  is  defined  to  be  the  right  hand  side  of  (6.6)  when 
X * 0 . (Thus  * 0)  • 


t 
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In  summary  once  we  have  a finite  stationary  Markov  Decision 
problem,  the  equations  (6.5) -(6.6)  describe  how  to  compute 
the  optimal  y » v and  (P  , and  bounds  on  these.  In  the 
next  sections  we  describe  some  alternate  approaches  to  handling 
the  difficulties  listed  above. 

7.  Refinement  of  Grid 

In  Section  4 we  discussed  the  transformation  which  leaves 
the  problem  invariant  except  for  changes  in  the  parameters  u^  , 


0 and  c . If  Uq  and  0 are 

replaced 

* 

by  Uq  and 

* 

0 , 

* 

c is  replaced  by  c 

and  the  change  of 

* 

1 in  Xj^  , 

* 

X2 

* 

and  t correspond  to 

changes  of 

-1 

ai  , 

a 2^  and  a^^ 

in 

x^,  ~x.2  and  t . Thus 

by  taking 

* * 

a and  u^j  to  be 

integers 

* * 

so  that  cr  /uq  is  large,  we  have  fine  grids  in  all  3 scales. 

* * 

For  example  if  a = Uq  = 1 and  o =2  and  Uq  = 1 , the 

* * * 

changes  of  1 in  x^^  , X2  and  t correspond  to  changes 

of  1/16  , 1/4  , and  1/4  in  ^ * 

8 . The  Finite  Set  6 of  States 

If  we  regard  £ as  representing  a region  in  the  (x^,X2) 
space,  some  point  near  the  boundary  will  have  a tendency  to 
be  followed  by  a successor  not  in  £ .In  the  infinite  set 
discrete  space  version  of  our  problem  such  a point  will  tend 
to  trace  out  a path  which  may  be  regarded  as  a temporary 


30. 


excursion  from  S.  . Our  strategy  is  to  select  £:  so  that 

a good  deterministic  policy  would  lead  to  relatively  short 

excursions.  It  seems  intuitively  clear  that  this  strategy 

would  be  helpful  in  minimizing  the  edge  effects  and  making 

it  easy  to  cope  with  them.  Informal  analysis  suggested 

an  ellipse  centered  at  the  origin  which  would  have  its  major 

2 

axis  go  roughly  through  the  points  (s  /2Uq,-s)  and 
2 

(-S  /2Uq,s)  where  s is  a size  pareuneter  and  the  designated 
points  are  on  the  parabola  leading  to  0 in  a reasonable 
suboptimal  policy  for  the  deterministic  problem.  Indeed,  the 
ellipse  selected  is 


2p 


V2 

^1*2 


1 


where  = 1.45s  /(4Uq)  , aj  » 1.7s/2  , and 


(1.45)  (1 
8 


.7)  4.4, 

7 * 7 ^ 

(1.45)^  (1.7)-^ 


.705 


This  ellipse  goes  through  the  points  indicated  above.  Note 
that  the  area  inside  the  ellipse  is  proportional  to  s^  . 

Thus  as  s Increases  and  the  grid  becomes  refined,  the  number  of 
points  in  £ grows  rapidly. 

Related  to  the  lattice  points  inside  the  ellipse,  a 


31. 


set  of  points  ® called  the  boundary  is  identified.  These 
are  the  possible  successors  (Xj^+X2 , X2+UQi+aj ) , (where 
i = ±1  or  0 and  j = ±1)  , which  are  not  in  ^ 

9 . Edge  Effects  Adjustment 

An  initial  conjecture  that  subsequently  proved  wrong 
was  that  differences  in  the  (optimal)  potential  function  for 
two  successive  states  far  from  the  origin  would  resemble  the 
differences  in  the  total  cost  function  for  the  optimal  policy 
in  the  deterministic  problem  and  that  these  differences  in 
turn  wopld  resemble  those  for  the  total  cost  function  for  one 
of  the  suboptimal  policies  in  the  deterministic  problem.  This 
conjecture  led  to  the  following  choice  of  Vq  and  edge  effect 
adjustment.  The  function  Vq  was  the  total  cost  function  for 
the  relatively  easily  computed  policy  which  was  described  in 
Section  2. 

The  edge  effect  adjustment  may  be  described  as  follows. 
We  shall  first  present  a policy  which  resembles  a good  policy 
for  the  deterministic  problem  when  x is  far  from  the  origin. 
In  the  unbounded  discrete  space  version  of  the  deterministic 
problem,  this  policy  would  lead  from  a point  ^ e (3  (the 
boundary)  to  an  excursion  which  ultimately  returns  to  some 

* r- 

reentry  point  x e c.  after  m{y^)  steps.  We  shall  act  as 


though 


I 


I 


I 


(9.1)  ^ . 

1 

This  permits  one  to  apply  equation  (6.6)  even  for  points 
xe  with  possible  successors  in  0 . In  effect  we  are 

simply  replacing  the  possible  successors  ^ of  x by  x e £. 

and  changing  the  cost  of  using  u when  X(t)  = x . 

* 

Here  Vq i^) -Vq (x  ) is  an  estimate  of  the  cumulated 

* 

cost  of  moving  from  ^ to  x and  ni(y^)Yjj  is  a correction 
I for  the  average  cost  of  taking  m(^)  steps.  The  latter 

I correction  is  based  on  a natural  interpretation  of  (6.6)  after 

■ transposing  y to  the  right  hand  side.  The  above  mentioned 

! ^ 

policy  used  to  determine  the  excursion  and  the  reentry  point 

it 

X is  to  use  u = -Uq  as  long  as  x is  such  that  X2  > 0 and 
2 3 

I *1  ^ “^2  /<2uq)  or  X2  < 0 and  x^^  > X2  / (2Uq)  , unless  this  gives 

a successor  (Xj^+X2 ,X2-Uq)  which  overshoots  the  parabola; 

2 

i.e.,  ^ ® ^ that 

case  use  u = 0 instead  of  u = -Uq  . This  covers  half  of 
the  X space  and  the  other  half  is  treated  synmetrically . 

At  this  point  we  comment  that  instead  of  using 

^ 2 2 2 

r(x,u)  » Xj^  +c|u|  we  use  Xj^  +Xj^X2+X2  /3  + c|u|  on  the  ground 

j that  if  x(0)  = X and  dXj^  ■ X2dt  , f dt'  - x^^*x^X2+x^/^  . 

I Thus  we  would  expect  the  revised  r(x,u)  to  better  reflect 

the  cost  of  the  continuous  time  problem  than  the  original 

■ r(x,u)  . 


Differences  in  Vq  between  successive  points  tend  to 
underestimate  the  corresponding  differences  in  the  potential 


function  for  the  less  favorable  stochastic  problem.  Hence  this 
edge  effect  adjustment  tends  to  lead  to  a solution  with  a lower 
value  of  Y than  that  of  the  infinite  discrete  stochastic 
problem.  Thus  as  the  size  s increases  the  corresponding 
values  of  y increase.  For  excunple  with  UQ  = a = c = l , 
we  have  y as  a function  of  s in  Table  1. 

* 

A slight  improvement  is  obtained  if  Vq(^)-Vq{x  ) in 
(9.1)  is  replaced  by  the  sum  of  the  r(x,u)  incurred  over 
the  excursion  from  y to  x . This  replacement  yields  a 
better  indication  of  what  the  edge  effect  in  the  discrete 

approximation  should  be  for  larger  s . The  improvement  is  i 

reflected  in  that  for  a given  vlaue  of  s , the  resulting  long 
run  expected  average  cost  y for  the  revised  version  is  slightly 
closer  to  the  limiting  value.  See  Table  1. 

One  would  expect  the  values  of  y to  decrease  as  we  use 
more  refined  grids.  Indeed  Table  2 presents  some  values  of  y 


for  3 grid  size  parameters  which  are  progressively  more  refined, 
The  latter  require  many  more  points  for  a given  s and  only 
relatively  small  values  of  s were  treated  in  preliminary 
calculations.  In  later  calculations,  the  expectation  of  de- 
creasing Y is  not  realized.  A possible  explanation  is  con- 
jectured in  conclusion  (c)  in  Section  11. 


Note  that  we  must  deal  with  a triple  limit.  The  number 
of  iteration  n must  the  size  of  the  ellipse  s must 

-*•  «»  and  the  grid  sizes  must  -►  0 . 

Several  alternatives  were  employed  to  improve  the  edge 
effect  performance  so  that  good  approximations  could  be 
achieved  without  an  undue  computing  burden.  These  are  des- 
cribed in  the  next  Section. 


10.  Alternative  Edge  Effect  Adjustments 

Two  major  alternative  approaches  were  used.  One  was  to 
start  with  coarse  grids  with  large  size  ellipses.  From  the 
iterations  in  this  case,  v is  estimated  by  interpolation 
inside  the  ellipse.  These  estimated  values  of  v were  used 
to  help  estimate  edge  effect  adjustments  in  later  computations 
with  finer  grids  and  smaller  ellipses.  The  second  approach 
was  to  simulate  random  excursions  from  £ to  estimate  v on 
the  boundary.  We  present  more. detail  below. 

(a)  Interpolation 

Suppose  that  the  approach  of  Section  9 has  been  applied 
for  a certain  grid  and  a large  ellipse  of  size  s^^ . 

After  a number  of  iterations,  good  estimates  y and  v and 
the  optimal  policy  are  obtained  for  the  corresponding  discrete 

approximation  to  our  problem.  Select  a finer  grid  (by  choosing 

* * 

new  values  of  Uq  and  a ) and  a correspondingly  smaller 


I 


m 


ellipse  ^2  size  S2  to  keep  the  number  of  points  in  £ 2 


* 

within  bounds.  By  interpolation  from  v compute  v the 
estimated  values  of  v on  the  new  grid  in  the  new  ellipse 
82  on  its  boundary  S2  • For  each  point  xe  8 2 

which  has  a possible  successor  ‘£>2  define  d(x,y^)  =v  (y^) -v  (x)  . 

Hereafter,  in  doing  the  backward  induction  the  term  the 

computation  of  E{v^ (X(t+1) ) ] X (t)  =x}  is  replaced  by  ^^-1  ^ +d(x,y^)  . 

With  this  treatment  of  the  edge  effect,  apply  backward  induction 
until  good  estimates  of  y and  v are  obtained  for  the  new  grid 
size.  This  process  of  refinement  of  grid  and  reduction  of  size 
can  be  repeated  using  the  interpolation  technique. 

Simply  reducing  the  size  of  the  ellipse  without  changing 
the  grid  size  shows  how  stable  the  method  is.  We  find  for 
example  that  with  Uq  = c = 0 = 1 and  s = 12.0  y = 9.2626  . 
Successive  reductions  in  sizes  from  s = 12.0  to  s = 6.4  and 
s = 2.4  , without  refinement  in  grid,  lead  to  estimates  which 
require  no  changes  of  Y and  . v through  further  iteration. 

Such  excellent  results  cannot  be  expected  when  the  grid  is 
refined  for  then  the  refined  discrete  problem  should  have  a 


somewhat  different  answer  depending  on  how  coeurse  the  original 
grid  was.  Another  potential  difficulty  is  that  the  interpolation 
process  is  not  very  accurate  for  v on  a coarse  grid.  Indeed 
the  bevahior  of  v as  X2  changes  is  difficult  to  approximate 
well  by  linear  or  quadratic  interpolation.  Although  the  es- 
timated values  within  S are  not  crucial,  the  values 
of  d(x,^)  are  very  important  in  determining 
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the  limiting  values  of  Y and  v , especially  when  the  size 
s of  the  ellipse  is  small. 

In  this  reduction  and  interpolation  approach  the  Markov 
Decision  Problem  with  finite  state  space  £2  been  replaced 

by  a new  problem  where  £2  augmented  by  the  points 
However  if  ^£^2  is  a successor  of  two  distinct  xe6’2f  the 
augmented  state  space  must  treat  ^ as  two  points,  else  there 
will  be  a discrepancy  due  to  the  fact  that  (x)  + d(x,y^) 

may  not  coincide  for  both  x.  In  this  related  problem  the 
equation  "*■  implies  a motion  from  y. 

to  its  predecessor  x with  a related  cost  d(x,y^)  + Yj^  which 
is  almost  stationary. 

(b)  Random  Excursion. 

j 

The  random  excursion  edge  effect  adjustment  differs  from 
the  effect  described  in  Section  9.  There  we  modeled  a problem 
which  is  essentially  one  where  points  outside  the  ellipse  travel 
without  the  influence  of  random  forces,  and  are  subjected  to  the 
suboptimal  deterministic  policy.  This  problem  underestimates 
the  cost  of  excursions  from  ^ . A more  realistic  estimate  can 
be  obtained  by  simulating  the  motion  with  the  random  forces 
until  the  point  which  has  left  £ returns  to  £ . To  do  so,  a 
point  which  leaves  £ from  x moves  to  ^e(3  and  from  ^ to 
X.'  * ^2  ^ ^ ^ where  u is  selected  to  be  Uq,0 

or  -Uq  according  to  the  optimal  deterministic  policy  and  w 
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is  selected  to  be  +1  or  -1  with  probability  1/2  by  use 

of  a random  number  generator.  From  the  point  moves  on 

c (1) 

until  it  ultimately  returns  to  C at  a point-  x after 
incurring  a cumulated  cost  (y)  in  n^^^ (y)  steps  from  y 

This  process  is  repeated  m times.  Hereafter  v^(y)  is 
replaced  by 


m ^ (y)  - n^^My)  Y„} 


In  effect  we  have  a Markov  decision  problem  where  the  state  y 
has  been  replaced  by  m equally  likely  states  in  S with 
appropriate  costs  attached  to  the  motion  required  to  reenter  £ . 

This  edge  effect  treatment  tends  to  overestimate  the  cost 
of  excursions  in  the  discrete  problem  since  the  policy  followed 
outside  £ is  suboptimal.  This  tendency  is  relatively  slight  and 
good  estimates  of  y can  be  obtained  for  smaller  ellipses  than 
those  used  in  Section  9.  However  these  estimates  are  affected 
by  the  random  process  used  in  the  simulation  and  fluctuate  from 
simulation  to  simulation.  Table  4 presents  the  results  of  several 
such  simulations.  This  indicates  that  the  simulation  techniques 
is  quite  effective  when  s is  as  large  as  6.  For  smaller  s, 
there  is  a great  deal  of  variability  and  the  fact  that  the 
excursions  are  guided  by  a suboptimal  policy  introduces  a positive 
bias. 


An  elaboration  of  this  approach  is  to  apply  the  method 
of  importance  sampling.  Here  one  decides  tor  each  which 

value  of  w,  plus  or  minus  one  is  most  favorable  in  the  sense 
that  it  more  resembles  the  optimal  deterministic  policy.  The 
faborable  value  of  w is  selected  with  probability  less 
than  1/2.  This  biased  sampling  procedure  gives  distorted 
estimates  which  are  easily  compensated  for  by  the  methods  of 
importance  sampling.  To  date  limited  experimentation  with 
importance  sampling  has  not  shown  great  improvement  over  the 
simpler  simulation  although  such  techniques  are  often  good  for 
reducing  variance. 

The  combined  effect  of  reducing  size  followed  by  random 
excursion  simulation  was  tried  with  a reduction  from  s = 10  with 


Uq  =a  =1  to  s = 3 and  4 with  Uq  =2,  a =3.  The  results 
were  variable  and  were  not  as  good  as  using  the  simpler  reduction 
plan.  The  conclusion  seems  to  be  that  random  excursion  simulations 
should  be  avoided  unless  s is  large  (greater  than  6 for 
Uq  = o = c = 1)  . 


Miscellaneous  Remarks 


(a)  Acceleration  Techniques 


The  study  of  successive  values  of 


Y indicates 
' n 


that  after  an  early  period  of  major  adjustments  Yj^  tends  to 
fluctuate  periodically  about  the  limiting  value.  In  particular 
for  c = Uq  =a  = 1,  successive  values  alternate  below  and  above 
the  limit.  In  that  case  the  occasional  replacement  of  v^(x)  by 
[v  (x)  + v - (x) 1/2  and  y by  (y  + Y i)/2  accelerates  the 
convergence.  On  other  occasions  the  use  of 

and  a similar  operation  on  v^  proves  helpful.  In  cases  where 
Y^  seems  to  be  increasing  steadily  by  small  almost  equal  increments 
the  occasional  use  of  Yj^  + speeds  up 

convergence.  Without  these  acceleration  techniques  the  case  of 
Uq  = c=  o=1,  s = 9.0  required  n = 150  to  converge  to  the 
point  where  the  sup  Jv  (x)  - v , (x)  | < .00024.  With  3 
applications  of  these  simple  acceleration  techniques,  only  70 
iterations  were  needed  to  obtain  this  result.  These  averaging 
methods  are  related  to  what  Kushner  and  Kleinman  call  the 
accelerated  Jaboci  Method  [8] . 

(b)  Evaluation  of  Suboptimal  Policies. 


A major  function  of  finding  and  evaluating  optimal 
policies  is  to  decide  whether  a convenient  or  simple  suboptimal 
policy  is  relatively  efficient.  To  do  so  one  must  also  be  able 
to  evaliaate  a specified  suboptimal  policy.  The  general  policies 
described  in  this  paper  are  easily  adapted  to  the  problem  of 
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evaluating  a specified  stationary  policy.  The  fundamental 
equation  (6.6)  is  changed  so  that  the  infimum  is  omitted  and 
the  distribution  of  governed  by  the  specified  policy. 

(c)  The  Method  of  Kushner  and  Kleinman 

Kushner  and  Kleinman  [ 7 ] and  Kushner  ( 6 ] 

present  an  approach  very  similar  to  that  of  this  paper  in  a 
related  problem.  However  their  treatment  of  the  edge  effect  was 
simpler  and  somewhat  less  inclined  to  give  good  approximations 
for  a given  size  region.  First  the  region  is  rectangular  rather 
than  elliptical.  Second,  points  on  the  boundary  of  the  rectangle 
sure  constrained  to  move  along  the  boundary  when  they  do  not 
reenter  naturally.  In  effect  the  boundary  acts  much  like  a 
reflecting  barrier  in  the  treatment  of  Kushner  and  Kleinman. 

This  is  a less  realistic  model  than  that  obtained  from  our 
treatment  of  simulated  deterministic  or  random  excursions. 

(d)  Rigorous  Treatment.  ' 

A more  rigorous  treatment  of  the  original  continuous 

time  continuous  unbounded  state  space  problem  involves  several  I 

t 

difficulties.  One  is  measure  theoretic  in  nature  but  that  seems  i 

to  be  subject  to  treatment.  For  example  see  Yeunada  I 9 ].  Another 
problem  is  that  caused  by  the  unboundedness  of  the  state  space  and 
the  potential  function.  This  seems  to  be  a more  fundamental  diff-  l 

Iculty  even  in  discrete  space  problems.  Recent  approaches  to  these  ! 

I 

problems  have  been  made  by  Bather  I 2 ] and  by  Hordijk,  Schweitzer  | 

and  Tyms  [ ^ ]. 
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11 . Computation 

After  a series  of  experiments  with  various  approaches, 
one  long  computer  run  was  executed  in  an  interactive  mode.  This 
run  applied  a sequence  or  reductions  in  size  s combined  with 
interpolations  to  more  refined  grid  sizes.  Table  5 outlines  some 
of  the  details  of  this  run  carried  out  for  Uq  =o=  c = 1.0. 

In  more  detail,  the  initial  approximation  to  v is  Vq 
derived  from  the  suboptimal  deterministic  policy  of  Section  2. 

The  results  of  the  successive  stages  were  potential  functions 

* it 

labeled  v(s,Uq  ,o  ) each  of  which  was  interpolated  to  serve 
as  an  initial  approximation  for  the  next  stage.  One  should 
recall  that  the  essential  aspect  of  these  approximations  are 
their  influence  on  the  edge  effect.  Changing  the  approximation 
inside  the  ellipse  only  affects  the  speed  of  convergence,  and 
that,  only  to  a relatively  minor  degree. 

Table  6 contains  these  estimates  of  the  poential  function 
and  also  v(10,l,l)  or  v(16,l,l).  These  are  based  on  one 
stage  starting  from  Vq.  The  latter,  v (16, 1,1)  was  inserted 
when  it  was  available  and  v(10,l,l)  was  not.  The  table  has  a 
considerable  number  of  gaps  due  partly  to  incomplete  print  out 
detail  but  mainly  to  essential  unavailability.  However,  the 
data,  as  presented,  permit  many  comparisions  to  be  made  and  from 
these  a reasonable  picture  of  the  natxire  of  the  potential  functions 
and  the  edge  effects  may  be  recovered. 
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Figure  2 presents  an  approximation  to  Aq  derived  from 
this  run.  Figure  3 presents,  on  a larger  scale,  the  coarse 
approximation  to  Aq  derived  from  the  first  stage  of  the  run. 
Several  conclusions  may  be  drawn. 

(a)  More  interesting  figures  for  Aq  would  have  resulted  if 
a larger  value  of  c had  been  selected.  Then  the  Aq 

region  would  have  been  larger  and  the  essential  coarseness 
of  the  grid  would  have  been  relatively  less  important. 

(b)  A few  iterations  provides  a good  estimate  of  Aq  . A 
coarse  grid  and  an  ellipse  with  relatively  few  points 
provides  a good  rough  approximation  to  y and  v with 
little  computing  effort.  Refinement  by  this  backward 
induction  technique  quickly  becomes  very  expensive. 

(c)  At  first,  refinement  of  grid  size  provides  the  controller 
more  opportunity  to  control  properly  and  y is  reduced. 
Subsequently  another  effect  tends  to  increase  y in 
later  stages.  While  inaccurate  interpolation  may  con- 
tribute a deleterious  edge  effect  which  raises  y , 

we  conjecture  that  another • effect  is  more  important. 

The  Brownian  Motion  was  modeled  by  a random  variable  which 
takes  on  values  ± 1 . This  model  for  a short  time  grid 
interval  more  nearly  resembles  Brownian  Motion  over  a 
unit  time  interval  and  hence  has  a higher  fourth  moment. 
Thus  the  ± 1 model  for  coarse  grids  has  a tendency 
to  reduce  the  resulting  y in  comparison  with  the 
Brownian  Motion  model.  An  additional  byproduct  of  this 
effect  seems  to  be  that  somewhat  larger  values  of  s 
are  required  as  the  grid  becomes  refined. 
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Table  1 


Long  run  expected  average  costs  y and  y for 
optimal  policy  in  discrete  approximation  problems 
as  a function  of  the  size  parameter  s 


Long  run  expected  average  cost  y 
for  optimal  policy  in  discrete  approxl 

mation  problem  as  a function  of  sixe 

* * 

and  grid  par2uaeters  (Uq  ,a  ) 


Limiting  values  y without  reduction 
and  y after  reduction  from  s = 10  ; 


on  seunples  of  9 trials  with  4 0 excursions 


Table  5 


Details  of  Computer  Run  u =o  = c = 1.0 


i 

s 

* 

'^O 

* 

0 

Ax2 

At 

n(£) 

n(«fl) 

Y 

1 

20.0 

1 

1 

1.0000 

1.0000 

1.0000 

5,455 

928 

9.263 

2 

9.0 

1 

2 

0.0625 

0.2500 

0.2500 

31,863 

4,243 

7.254 

3 

4.0 

1 

3 

0.0123 

0.1111 

0.1111 

31,863 

5,463 

7.471 

4 

2.2 

1 

4 

0.0039 

0.0625 

0.0625 

29,788 

6,403 

7.597 

stage;  s = size;  AXj^,Ax2»At  are  grid  sizes  in  Xj^,X2/t 

scales;  n(£)  and  nO)  are  the  number  of  lattice  points  in 
the  i-th  ellipse  and  i-th  boundary.  y = the  long  run  average 
cost  for  the  i-th  stage  approximation  to  the  continuous  opti- 
mization problem. 


Table 

Approximations  to  Potential  Function 
Ujj  « o ■ c - 1 


8829 

20964 

20624* 

2105  5789 

6027 

6393 

6649* 

6911 

5569 

285  1116 

330  1201 

1 379 

1319 

1416* 

1518 

1286 

13  98 

20  117 

28 

124  98 

145*  117 

167 

124 

5 40 

8 51 

14 

53*  41 

69*  52 

86* 

41 

1.7  12.4 

3.5  18.1 

7 

9.3  12.9 

19.2  18.6 

33 

9.3  13.2 

18.8 

0.6  1.8 

1.5  . 4.4 

3.3 

2.3*  2.3 

8.5*  4.7 

18.9* 

2.3 

4.3 

0.0  0.0 

0.6  0.9 

1.8 

0.0  0.0 

2.2*  0.6 

8.9 

.0.0  0.0 

0.6 

0.6  lie 

0.6  0.9 

1.2 

-8.5*  2.3 

-9.4*  1.4 

-5.6* 

2.3 

1.5 

1.7  12.4 

1.1  8.8 

1.3 

9.3  12.9 

3.3*  9.3 

1.3 

9.3  13.2 

9.5 

5 40 

3 31 

2 

38*  41 

26*  32 

14* 

41 

33 

13  98 

8 82 

5 

124  98 

102*  83 

80 

124 

83 

285  1116 

244  1037 

206 

1319 

1226* 

1137 

1286 

2105  5789 

5537 

6393 

6144* 

5902 

5569 

8829 

20964 

20624* 

6530 

1 : 

1 

6798  1 

7181* 

7460 

1 

i 

433 

1387 

491 

1487  1 

1627* 

1743 

1 

40 

163 

53 

190  i 

195* 

162 

228 

180  1 

20 

81 

29 

100 

107* 

81 

131* 

99 

11 

36 

17 

i 

49  > 

49 

36 

67 

« i: 

6 

15 

10 

22 

32* 

15 

49* 

22 

3.7 

5.7 

6 

11 

19.3* 

6.1 

34 

11  1 

6.1 

1 

2.6 

4.4 

4.6 

7.9  1 

2.2* 

4.4 

14.2* 

7.9  1 

4.4 

7.9 

6.8 

3.7 

8.5 

4.1* 

7.5 

11.3 

9.3 

7.8 

M 

2 

21 

3 

19  i 

9* 

22 

9* 

20  < 1 

23 

21 

3 

58 

4 

51 

65* 

60 

• 55 

52  ■ : ; 

60 

53  1 : 

173 

891 

143 

826 

1055* 

5070 

979 

. ! 

48S2' 

5648* 

5441 

( 

Interpolated  value 

v(16,l,l)  in  place  of  v(10,l,l) 


Table  6 (continued) 


Approximations  to  Potential  Function 

v(9,l,2) 

Ujj  ■ o»  c ■ 1 

V(20,l,l) 

v(4,l,3) 

. 

V{10,1,1) 

v(22,l,4) 

4.0 

8.0 

12.0 

16.0 

20. Q 

I 

11844 

' 15390 

J 

19484 

" 1 

1 

24138 

29367 

CD 

e 

o 

25458 

30544 

j 

36232 

1 

42541 

49474 

25036' 

29884' 

: 

35419* 

j 

41265' 

47926' 

3464 

7929 

5231 

7425 

1 

10062 

13160 

6.0 

8640 

7267 

11338' 

1 

1 

14532' 

18230' 

22406' 

764 

1938 

1531 

1 

1 

2609 

4662 

4020 

5785 

4.0 

2251 

3524 

j 

5162 

i 

1 

7204 

9652 

2184 

3391 

♦ 

4899 

1 

6684 

9652' 

132 

333 

426 

785  ; 

928 

1484 

1666  2460 

2663 

3746 

2.0 

393 

919 

1699 

4121 

393 

919 

1697 

2735 

4100 
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FIGURE  2 APPROXIMATION  TO  An  BASED  ON  COMPOSITE  CALCULATION 


£2  Oy  LU 

? 3 t!  UJ 
o ogx 
Q.  CD  — H 

o e 


— CVJ 
I I 
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